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A Comparison of Item Response Theory 
And Observed Score DIF Detection Measures 
For the Graded Response Model 



Abstract 

This paper provides a review of procedures for detection of differential item functioning 
(DIF) for item response theory (IRT) and observed score methods for the graded response 
model. In addition, data from a test anxiety scale were analyzed to examine the congruence 
among these procedures. Results indicated stronger agreement within IRT methods and 
within observed score methods than between these two sets of DIF detection methods. A 
discussion is included focusing on reasons for these similarities and differences. 



Key words: area measures, chi-square test, differential item functioning, generalized Mantel- 
Haenszel test, graded response model, item response theory, likelihood ratio test, Mantel test, 
simultaneous item bias test. 



Introduction 



Graded response items are particularly useful for test items in which examinees answers are 
not simply scored as correct or incorrect. As on any test, however, items which function 
differently in different groups need to be detected and, if necessary, removed, because they 
present a threat to the validity of the test. Although a number of methods for detection of 
such items have been developed for graded response items, either based on item response 
theory (IRT) or based on observed scores, very little research has compared results from 
these two sets of methods. 

One problem which faces developers of tests using polytomous response items is that 
the different DIF detection indices tend to identify different sets of items on the test as 
functioning differentially (e.g., Ankenmann, Witt, & Dunbar, 1996; Chang, Mazzeo, & 
Roussos, 1996; Kim, Cohen, k Baker, 1996; Welch k Hoover, 1993; Zwick, Donoghue, 
k Grima, 1993). Given such a scenario, it can be difficult for one to determine which, if any, 
DIF indices should be used. In this paper, we provide a review of IRT and observed score 
methods for detecting DIF in graded response items with an eye toward examining what is 
measured by each index. We then provide a comparison of the procedures reviewed using a 
set of graded response test data. 

The Graded Response Model and DIF 

In the context of dichotomous IRT models, an item is said to be functioning differentially, 
when the probability of a correct response to the item is different for examinees at the 
same ability level but from different groups (Pine, 1977). The presence of such items on a 
test indicates that examinees at the same underlying 9 may exhibit systematically different 
patterns of item responses. In this section, we describe the graded response model under 
IRT (Samejima, 1969, 1972) and methods for detecting DIF items in that model. 

The item response function (IRF) is the basic building block of IRT. For a dichotomously 
scored item, the IRF is usually taken to refer to that function which characterizes the 
relationship between the probability of a correct response to an item and examinee trait 
level 9. There are, however, two IRFs for a dichotomous item, one for the correct response 
and one for the incorrect response. 
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The Graded Response Model 



Samejima (1969, 1972) proposed a graded response model under IRT in which the category 
response function, Pjk(9), describes the probability of response k to item j as a function of 
9. For an item with Kj categories, Pjk(9 ) is defined as 



PjkW = 



i - p;,m 
p^m - pm 
p m-m 



when k = 1 

when k = 2, . . . , ( Kj — 1) 
when k = Kj, 



(1) 



where k = 1, . . . , (Kj — 1). In Equation 1, Pjk(9) is the cumulative category response function 



given by 

p jk( e ) = {1 + eM~ a j( d ~ ( 2 ) 



where ct, is the discrimination parameter for item j, /3jk is the location parameter of response 
category k for item j, and 9 is the trait level parameter. The logistic model in Equation 2 
is a homogeneous case of the general graded response model (Samejima, 1972, 1997). With 
P* 0 (9) = 1 and PjK j (9) = 0, the category response function can be succinctly written as 



pm = p H k-im - pm- 



( 3 ) 



Item True Score Functions. For a polytomously scored item such as the graded 
response item (Samejima, 1969, 1972), the item true score function describes the relationship 
between the expected value of the item score and examinee trait level. 

Baker (1992) defined the true score function for the graded response model as 

T(») = E £ UikPm, w 

j=l fc=l 

where J is the number of items in the test and Ujk is the weight for response category k 
of item j. Weights are typically, but not necessarily, taken to be the same as the category 
values. For example, the weight for category 1 would be 1, and for category 4 it would be 4. 

The item true score function for a single item j can be defined as 

Kj 

T j (0) = 'E*>*Pm- (5) 

fc=l 

Definition of DIF. In the typical DIF study, there are two groups of examinees, the 
reference group and the focal group. For a dichotomous item under IRT, the IRF is the item 
true score function. For both dichotomous and graded response items, an item is considered 
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to be functioning differentially, when the item true score functions in the reference and focal 
groups are not equal (Cohen, Kim, & Baker, 1993). That is, a DIF item is identified, when 
Tjr{9 ) 7 ^ Tjp{0). Further, the item true score functions from the reference and focal groups 
are identical if and only if the cumulative category response functions for the reference and 
focal groups are equal or the sets of item parameters from the reference and focal groups are 
equal. These two conditions are essentially equivalent. 

Detection of DIF. The equality of sets of item parameters for graded response items 
can be tested using several different approaches. One approach, the chi-square test, is to 
compare item parameters estimated from the two groups (e.g., Cohen et al., 1993; Millsap 
& Everson, 1993). A second approach is to obtain and test area measures or distances 
between item true score functions (e.g., Cohen et al., 1993; Flowers, Oshima, & Raju, 1995; 
Raju, van der Linden, & Fleer, 1995). A third approach, the likelihood ratio (LR) test 
(Thissen, Steinberg, & Gerrard, 1986; Thissen, Steinberg, & Wainer, 1988, 1993; Wainer, 
Sireci, & Thissen, 1991), uses a likelihood ratio (Neyman & Pearson, 1928) to compare 
likelihood functions estimated from different groups in order to evaluate differences between 
item responses from the two groups. Thissen, Steinberg, and Wainer (1988) noted that the 
third approach is preferable for theoretical reasons because the first and second approaches 
may require estimates of variances and covariances of the item parameters. At the present 
time, computational difficulties impede obtaining accurate estimates of these variances and 
covariances. 

Ankenmann et al. (1996) compared power and Type I error rates of the LR test and the 
Mantel (1963) test for DIF detection under the graded response model (Samejima, 1969, 
1972). Ankenmann et al. (1996) used combined dichotomous and graded response item data 
and obtained the power and Type I error rates for a single studied graded response item in 
each data set under different sample sizes and ability conditions. The LR test was found 
to yield better power and control of Type I error than the Mantel procedure (Ankenmann 
et al., 1996). Kim and Cohen (in press) reported Type I error rates of the LR test for DIF 
detection for a graded response model with five ordered categories. Data were generated for 
a 30-item test for six combinations of sample sizes by underlying ability conditions. Type I 
error rates of the LR test were found to be within theoretically expected values at each of 
the nominal alpha levels considered. Analysis of Type I error rates for the chi-square test 
and the area measures described by Cohen et al. (1993), however, indicated mixed results 
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(Kim et al., 1996). Type I error control was conservative for the chi-square and the signed 
area measure but poor for the unsigned area measure. The LR test of DIF under the graded 
response model seems promising, but it can be computationally quite intensive (Thissen et 
al., 1993). 



IRT Methods for DIF Detection 
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The Chi-Square Test. A x 2 originally described by Lord (1980) for dichotomous IRT 
models can also be used to test the hypothesis that the parameters estimated for a graded 
response item are the same between reference and focal groups (Cohen et al. 1993). The x 2 
statistic for the graded response model item with Kj categories is computed as 

Xi = (6) 

where £ is the vector of difference between parameter estimates (i.e., = £ - F — £ jF ) and 

Sj 1 is the inverse of the variance- covariance matrix, (i.e., Ej = -I- Sjf)- 

The vector of item parameter estimates for the reference group can be written as 

ijR ~ [ a jRi bjiR, b j{Kj _ l)R ] (7) 

and the variance-covariance matrix can be written as 



'SjR ~ 



Var (a jR ) Cov(a jR ,bji R ) 

V MbjiR) 



Cov(a jR , bj( Rj -i) R ) 
Co v(bji R , bj( Rj -\) R ) 



(8) 



v ar(6 j( ^._i) F ) 

The vector of item parameter estimates and the estimated variance-covariance matrix for 
the focal group can be defined similarly. There are Kj degrees of freedom for this extension 
of Lord’s x 2 for a graded response model with Kj categories. 

The Signed Area. Raju (1988, 1990) developed a test of the signed area between item 
response functions for dichotomous models. An extension of this test for graded response 
items (Cohen et al., 1993) is given below. Let 



k, 



Tj R (0) = ^2 u jkPjk R {9) 

k = 1 

be the estimate of the item true score function for item j in the reference group and let 

Ki 



(9) 



T)f(^) = ^ 2 U jkPjkF(9 ) 

k=\ 



( 10 ) 
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be the estimate of the item true score function in the reference group, where PjkR(0 ) and 
PjkF{0) are the estimates of the cumulative category response functions for the reference and 
focal groups, respectively. 

According to Cohen et al. (1993), the signed area Sj between the two item true score 
functions is obtained as 

Sj = J [fjR(9) - Tjf(0)] d6 = ~ u jk\ ( b ikF ~ bjkR ), (11) 



where bjkR and bjkF are the estimates of fijkR and PjkF, respectively. The estimated variance 
of Sj is defined as 

Kj - 1 

Var(Sj) = 5Z [ u A*:+i) “ u jk\ Va T (b jk F ) + 

fc=i 

Kj—l Kj—\ 

13 53 [ U i(fc+1) ~ u jk] [%(i+i) - Uji] Co v(b jk F,bjiF) + 
fc=l 1=1 
Kj - 1 

k=i 

Kj - 1 Kj - 1 

53 53 [ w .?(k+i) — ^'kj — Co v{bjkR,bjiR), (12) 

fc=i 1=1 

where k l . 

The test statistics Z(Sj) can be written as 



Z(Sj) = 



Sj 

^/Var(5,) 



(13) 



and is based on the assumption that the observed signed areas Sj are normally distributed 
with mean 0 and variance given in Equation 12. 

The Unsigned Area. Raju (1988, 1990) also developed an unsigned area test for 
the difference between item response functions for dichotomous items. Cohen et al. (1993) 
showed that the unsigned area, Uj, between the two item true score functions is obtained as 



Uj = f"Jt iR (e)-T iF (0)\M. 



(14) 



Expressing Uj in terms of cumulative category response functions gives 



^ - PmW] 



dd. 
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If either fjn(8 ) > Tjp{9) or TjR(9) < Tjp{9 ) for all 9, then 



Uj = \Sj\ • (16) 

Assuming that the Sj are normally distributed with mean 0 and variance as given in 
Equation 12, the expected value of Uj is 

B(Uj) = ^Var(Sj) (17) 

and the variance of Uj is 

Var (Uj) = Var (Sj) (l - (18) 

(Hogg & Craig, 1978). It should be noted that the assumption of normality for Uj may not be 
justified (Raju, 1990). Equation 13 also provides a test of the null hypothesis of no DIF only 
if either TjR(O) > fjp(9) or TjR(9 ) < fjp{9) for all 9. If either condition TjR(9 ) > fjp{9) or 
t ia W < Tjp{9) for all 9 does not hold, Uj may not have a closed form. In such a case, no 
statistical test is yet available for the null hypothesis. Even so, it still may be of interest to 
examine the size of Uj and to test its significance with the variance given in Equation 18. 

The following approximation may be used for Uj\ Select two points 9p and 9u such that 
9l < 9u and divide the range into N intervals. The area Uj then is estimated using the 
trapezoidal approximation of the bounded unsigned area (Burden & Faires, 1985) as 

Uj = E|[t.,(» + i,-u, i ][^(e j )-^ F (# j )]|A# + 

i= 1 

5 lh<* + i) - »#] [p; tR (h) - /3w0i)]| - 

5 1 [ u i(*+l) - u ik] [-Pftji(fe) - -P/tf-tM] I (19) 

where A# = ( 8u — 9i)/N . 

The Likelihood Ratio Test. The LR test for DIF described by Thissen et al. (1986, 
1988, 1993) compares two different models — a compact model and an augmented model. 
The LR test statistic, G 2 , is the difference between the values of —2 times the log likelihood 
for the compact model (— 21ogLc) and —2 times the log likelihood for the augmented model 
(— 2 log La)- The values of the quantity — 21ogL can be obtained from the output of the 
calibration runs from the computer program MULTILOG (Thissen, 1991) and are based on 
the results over the entire data set following marginal maximum likelihood estimation. 
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Let yj be the polytomous score for item j (i.e., yj = 1,2, ... ,k, , Kj ) and let 

«*-{j otherwise . < 20 > 

be the indicator variable for item j. Without loss of generality, it can be assumed that all 
items in the test have the same number of categories K. The category response function 
describes the probability that yj — k at ability level 9 and is defined as 

Prob { Vj = k\9,^} = P jk (0) = fl P jk (0)** k , (21) 

fc=i 

where represents the vector of item parameters. Under the assumption of local 
independence, the conditional probability, given 9, of a particular response vector or Ith. 
response pattern y / = (j/i, J/2> • • • > Vj) can be written as 

p(y«i«) = n n w i *. (22) 

j= i fc=i 

where J is the total number of items in the test. The marginalized probability of a response 
pattern y / = (j/i, J/ 2 , • ■ • , Vj ) can be written as 

P{yi) = J P(yi\0)n(0\T)d9, (23) 

where tt{9\t) is the ability distribution and r are the population ability parameters (see 
Bock & Aitkin, 1981; Thissen et ah, 1986). The distribution of ability in the usual IRT 
model is Gaussian, and, hence, r contains y. and a 2 . 

To obtain the marginal likelihood, the item response data are summarized to yield raw 
counts of the number of examinees giving each particular response pattern across all items are 
used. The counts for group g are denoted by r g (yi) and fill the cell of a K J contingency table 
of all possible response patterns for each group. The marginalized probability of observing 
an examinee in group g with a response pattern y j is 

Pgiyi) = J P(yi\Q)n(9\Tg)d9. (24) 

The likelihood for the complete set of K J tables for all the groups is proportional to 

n n w r,(y,) . (25) 

9=1 t=l 

where G is the number of groups. The marginal maximum likelihood estimates of the 
parameters of interest can be obtained using the algorithm described in Bock and Aitkin 



(1981). Using default options, the computer program MULTILOG yields the location and 
scale of 9, arbitrarily set by fixing hr = 0 and cr 2 R = 1 for the reference group. In addition, 
a default in MULTILOG also imposes the constraint a 2 R = o 2 F . Then, 



-21ogL = — 2££> 9 (y,)log 



9-1 l-l 



N,P,( y,) 
r »(yi) 



(26) 



with Ng = S( r s(y;) (be., the number of examinees in group g ) and P g (yi) computed from 
the marginal maximum likelihood estimates of the parameters. [See Bishop, Fienberg, and 
Holland (1975) for an extensive discussion of the use of the likelihood ratio statistic in the 
context of model-fitting for contingency tables.] 

In the compact model, the item parameters are assumed to be the same for both the 
reference and focal groups. MULTILOG has an option that permits equality constraints to 
be placed on items for estimation of the compact model. In the augmented model, item 
parameters for all items except the studied item are constrained to be equal in both the 
reference and focal groups. These constrained items are referred to as the common or anchor 
set. 

The LR test statistic can be written as 



G 2 = —2 log Lc — (—2 log L a ) (27) 

and is distributed as a x 2 under the null hypothesis with degrees of freedom equal to the 
difference in the number of parameters estimated in the compact and augmented models 
(Rao, 1973). When a graded response item with four categories is tested, G 2 is distributed 
as a x 2 with four degrees of freedom. 

Observed Score Methods for DIF Detection 

Two extensions of the Mantel-Haenszel test for dichotomous models (Mantel & Haenszel, 
1959) have been proposed by Zwick et al. (1993) for graded response items; the Mantel 
(1963) test and the generalized Mantel-Haenszel test (Mantel & Haenszel, 1959). The Mantel 
test assumes that item responses are ordered, whereas the generalized Mantel-Haenszel test 
assumes that item responses are nominal. The assumption underlying the Mantel test would 
appear to be theoretically more consistent with the ordered nature of scores used for graded 
response items. Chang et al. (1996) have described an extension of the simultaneous item 
bias test (SIBTEST) of Shealy and Stout (1993) for use with polytomous models. 

O 
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The Mantel Test. Mantel (1963) proposed a test of conditional independence for the 
case of K ordered categories (see also Agresti, 1990, pp. 283-284). Application of the method 
in the DIF context involves assigning ordered index numbers to the response categories and 
then comparing the item means for examinees of the reference and focal groups who have 
been matched on a measure of proficiency. Ankenmann et al. (1996), Chang et al. (1996), 
Welch and Hoover (1993), Welch and Miller (1995), and Zwick et al. (1993) investigated this 
statistic in their studies of DIF methods for the polytomously scored items. 

In a DIF study of an item with K ordered response categories, there will be a separate 
2 x K contingency table for each level of the matching variable. The data can be arranged 
into a full 2 x K x L contingency table, where L is the number of levels of the matching 
or stratification variable. The total raw score is often used as the matching variable in the 
Mantel test. For the Zth level of the matching variable, for example, a 2 x K contingency 
table can be constructed to contain the data as shown in Table 1. The values, Yi, . . . , Yk, 
represent the scores that can be obtained on the item. The item scores are typically, but 
not necessarily, the natural numbers (i.e., 1 , . . . , K). The values of Aki and Bki denote the 
number of focal and reference group examinees, respectively, who are at the Zth level of the 
matching variable and received an item score of Yk. The marginal total of the focal group 
of the Zth level is denoted as Npi, and that of the reference group as Nm. The total number 
of focal and reference group members with an item score Yk at the Zth level of the matching 
variable is denoted by Mki • The total number of examinees at the Zth level of the matching 
variable is denoted by 7}. 



Insert Table i about here 



Given the marginal totals in each level of the matching variable, under the assumption of 
conditional independence of the item score variable Y and the group membership variable, 
the observed sum of the weighted scores 

K 

T. AkiYk (28) 

fc=i 

has its expectation and variance defined as 



E 




K 

Nfi £ MklYk 



k= 1 



T t 



(29) 
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and 
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(30) 



When a dichotomous variable, say X, is used for the group membership variable (e.g., X F = 1 
and Xr = 0), then the value from the single contingency table is 

■ K ( K > ■> 2 

'ZAuY k -E['E l A u Y k 



Ufe=l 



\k=\ 



Var (jt AuY t ) 



( 31 ) 



and is the same as the squared point biserial correlation between X and Y , multiplied by 
the sample size minus one ( T) — 1) for the Ith level of the matching variable. Under the 
null hypothesis of conditional independence, either the point biserial correlation or the value 
from Equation 31 should be close to zero. 

To summarize the association from all L levels of the matching variable, Mantel (1963) 
proposed the statistic 

-,2 






EE4n-E£(E44 

k-l 1 = 1 \fc= 1 / . 



Ev«r(f;4Bn) 

1=1 \k=l J 



(32) 



The expected value and the variance are obtained under the assumption of the conditional 
independence between the item score variable and the group membership variable in each 
level of the matching variable. Under the null hypothesis of no association, H 0 , the test 
statistic M 2 is distributed as a chi-square with one degree of freedom provided that the total 
sample size is large. For dichotomous items, this test statistic is identical to the Mantel- 
Haenszel (1959) statistic without the continuity correction. In DIF applications, rejection 
of Ho indicates that examinees in the focal and reference groups who are similar in overall 
proficiency with respect to the matching variable tend to differ in their average performance 
on the studied item. 

The Generalized Mantel-Haenszel Test. Mantel and Haenszel (1959) described 
a generalized extension of the ordinary Mantel-Haenszel statistic to the case of K > 2 
response categories [see also Agresti (1990, pp. 234-235) and Somes (1986)]. The generalized 
statistic tests the conditional independence for an unordered group variable and K response 
categories. Application of the method in the DIF context involves assigning nominal numbers 
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to the response categories and then comparing the vectors of the item responses for examinees 
of the reference and focal groups who have been matched on a measure of proficiency. 

Using the notation in Table 1, assuming fixed marginal totals in each level of the.matching 
variable, the observed vector of the number of examinees for . . . , Y^-i of the focal group 
is 

a* = {Au, • ■ ■ i Aki , . . • , A(k-i)i)' (33) 

which has expectation and variance 

E(a.t) = Npimi/Ti (34) 

and 

Vi = [ T(dia s( mi ) ” mim i] > ( 35 ) 

where 

m / = {Mu, . . . , Mu, ..., M(k-i)0 # - (36) 

The expected value and the variance are based on the conditional independence of the item 
score variable and the group membership variable. As noted in Agresti (1990), the value 

(37) 



is the Pearson (1900, 1922) chi-square statistic for testing independence, multiplied by a 
factor (T) — 1 )/Tj. 

The generalized Mantel-Haenszel statistic summarizes the association from all L levels 
of the matching variable and is defined as 





/ 

L L 


r L i 


-1 


L L 


Q 2 = 


X>- I>(a<) 
. 1=1 1=1 


£v, 

ll=l J 




S a, -££(»,) 
. 1=1 1=1 



(38) 



LI L 

If we let a = j, e = ^^(aj), and V = y"V;, then Q 2 can be written in quadratic form 
i= i (=i i= i 



as 



<2 2 = (a-e/V ^a-e). (39) 

Under the assumption of conditional independence, the test statistic Q 2 has a large-sample 
chi-square distribution with K — 1 degrees of freedom when two groups are used. In case of 
dichotomous items, this statistic is identical to the Mantel-Haenszel (1959) statistic without 
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the continuity correction. In DIF applications, rejection of Ho indicates that examinees of 
the focal and reference groups who are similar in overall proficiency tend to differ in their 
performance on the studied item. 

The SIBTEST for Polytomous Items. Chang, Masseo, and Roussos (1996) describe 
an extension of the SIBTEST for dichotomous items (Shealy & Stout, 1993) to polytomous 
items such as graded response items. The amount of DIF measured by this method is 



B o (0) = E R (Yj\6) - E F (Yj\9), 



(40) 



£«!«) = («) 

k = 1 

R and F designate the reference group and the focal group, respectively, and Yj represents the 
score that can be obtained on the item. The item scores Yj are possibly, but not necessarily, 
the natural numbers (i.e., 1 , . . . , Kj). If there are the same number of categories in all items, 
then, without loss of generality, we can write K = Kj. 

A global index of DIF (Shealy & Stout, 1993) is given by 



ft = / B^6)g F [6)dB, 



(42) 



where gp{9) is the density of 9 in the focal group. This is interpreted as the expected amount 
of DIF experienced by a randomly selected examinee from the focal group. 

Two minor modifications to the original SIBTEST are needed to accommodate polyto- 
mous data: (1) replacement of the number of items in the SIBTEST with the maximum test 
score due to polytomous scoring and (2) modification of the matching test reliability esti- 
mates used by Shealy and Stout in their regression correction, substituting with Cronbach’s 
alpha for KR20 (Chang, Masseo, & Roussos, 1996). 

The test statistic Bj is defined as 



— 



ft 

s MPj)' 



(43) 



where 



L 



ft = EM*- 
1 = 1 



(44) 



di = Yjiu — Yjpi is the group difference in performance on the studied item for the examinees 
in the Ith. matching variable, pi is the proportion of the examineees in the Ith. matching 



variable (i.e. , pi = Ni/N), and 



s.e.0j) 




VarRj(Vj) Var F i(Yj) 

N m N Fl 



) 



( 45 ) 



where Var R /(l^) and Var F i(Yj) are the sample variances of the studied item scores for the 
Ith matching variable for examinees in the reference and focal groups, respectively. It can 
be seen that iVj = Nm + N F i and N = Nr + N F , where Nr = Nri and N F = N F i. 
The total score for the matching variable can be obtained as 

x = 'tx j , (46) 

;=i 

where J is the total number of items used in the matching variable and Xj are the jth item 
scores (e.g., 1, . . . , Kj). If we assign 1 to Kj for the jth item scores, then X will be J, J + 1, 
. . . , Kj. In this case, the first level, l = 1 , corresponds to X = J, and the highest level 
l = L corresponds to X = Kj- 



Linking and Purification 



Linking Metrics. As in the case for the dichotomous IRT models, the transformation 
or linking of the metric of the focal group to the metric of the reference group is required 
under the graded response model before DIF comparisons are made. Baker (1992) extended 
the test characteristic curve method for linking (Stocking & Lord, 1983) to the case of the 
graded response model. Recent evidence (Cohen & Kim, in press) suggests that the test 
characteristic curve method may be more accurate than the minimum chi-square method or 
mean and sigma methods. 

Linking of metrics is required only when item parameter estimates are obtained separately 
in both groups. DIF comparisons using the LR test procedure do not need to be preceded by 
linking as item parameters are estimated simultaneously in both groups. In the LR method 
described by Thissen et al. (1988, 1993), the likelihood from a compact model, in which no 
group differences are assumed to be present, is compared to that from an augmented model in 
which one or more items are examined for possible DIF. The metric of the augmented model, 
as well as the metric of the compact model, is dependent upon a set of anchor items that 
are assumed to be free of DIF. Although likelihoods obtained via simultaneous calibration 
do not require any linking transformation from one metric to another, comparing a compact 
model to an augmented model does require two separate calibrations for each comparison, 
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one for the compact model and one for the augmented model in which at least one item is 
unconstrained in the two groups. 

Since the methods based on observed scores are not involved with calibration of item 
parameters, the Mantel test, the Mantel-Haenszel test, and the SIBTEST do not require 
linking. All three observed score methods, however, assume that there exists a matching 
variable. The matching variable provides an observed score metric on which item response 
patterns are compared. 

Scale Purification. The linking required for the chi-square test and the area measures 
may be seriously affected by the presence of DIF items in the set of items used for calculation 
of the linear transformation coefficients. The results for the dichotomous IRT models indicate 
that spurious identification of items as DIF or non-DIF may result in the presence of DIF 
items on the test (cf. Lautenschlager &; Park, 1988; Shepard, Camilli, & Williams, 1984). 

Two methods, scale purification (Lord, 1980, p. 220) and iterative linking (Candell & 
Drasgow, 1988), have been recommended for dealing with this problem for the dichotomous 
IRT models. Iterative linking can be generalized to polytomous IRT models without any 
modification. Iterative linking described by Candell and Drasgow (1988) proceeds as follows: 

1. Estimate item parameters for the reference and focal groups separately. 

2. Place the focal group item parameter estimates onto the scale of the reference group. 

3. Calculate DIF indices and stop the process if no DIF items are found. 

4. Otherwise, remove the DIF items and recalculate the linking coefficients using only the 
remaining non-DIF items. 

5. Calculate DIF indices for all items (including previously identified DIF items). 

Steps 4 and 5 are continued until the same set of DIF items is identified on a subsequent 
iteration. Note that the iterative linking procedure requires item parameters be calibrated 
one time only in each group. The iterative linking procedure can be applied to the chi-square 
test and the area measures. 

For the LR test, Thissen, et al. (1988, 1993) indicate the need for excluding DIF items 
from the set of items used as the internal anchor. The approach recommended by Thissen 
et al. (1988, 1993) for dichotomous items is to first use the Mantel-Haenszel x 2 (Holland & 
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Thayer, 1988) to identify DIF items to be removed from the anchor set. Kim and Cohen 
(1995) describe an iterative procedure for scale purification with the LR test. 

Scale purification for the Mantel test and the generalized Mantel-Haenszel test has not 
been discussed extensively. Zwick et al. (1993) indicated that the studied item should be 
included in the matching variable. Once DIF items are identified, however, it is possible 
to remove them from the analysis and sequentially test the remaining items for presence of 
DIF. 

For the SIBTEST, Stout and Roussos (1996) offer the following scale purification steps: 

Step 1. Conduct a DIF analysis over the J items of interest. On the J runs of SIBTEST, 
each item is evaluated sequentially and the remaining J — 1 items are used to form the 
matching variable. If any DIF items are detected, those items form the Step 1 suspect 
set. 

Step 2. Conduct the second DIF analysis using the items that were not included in the 
Step 1 suspect set. If there are J' of such items, then there will be J' subsequent runs 
of SIBTEST each with J' — 1 items forming the matching variable. If any additional 
DIF items are detected, the flagged items form the Step 2 suspect set. 

Step 3. Combine the two sets of suspect items and form the Step 3 suspect set. Test each 
item sequentially in the Step 3 suspect set, one at a time. The unflagged items from 
Step 2 are used as the matching variable. All items rejected based on a prespecified 
nominal alpha level are considered to be the DIF items. 

Method 

Data 

Data from Nasser, Takahashi, and Benson (1997) were reanalyzed for purposes of this study. 
The data were obtained from participants responses to an Arabic version of Sarason’s (1984) 
Reactions to Test (RTT) scale. The RTT scale consists of 40 Likert-type items with four 
options. The sample consisted of 421 tenth graders from two Arab high schools in the central 
district of Israel. There were 226 female students and 195 male students in the sample. The 
purpose of DIF analyses was to compare the item responses of female and male students. 
For purposes of this study, female students were treated as the reference group and male 
students as the focal group. 



Parameter Estimation and DIF Detection Procedures 



Item parameter estimates for the graded response model were obtained using marginal 
maximum likelihood estimation via the computer program MULTILOG (Thissen, 1991). 
For Lord’s x] and the two area measures, item parameter estimates, a ; and bjk ( k = 1, 2, 3), 
were obtained using marginal maximum likelihood estimation via separate calibration runs 
of MULTILOG. 

The computer program EQUATE 2.0 (Baker, 1993) implements the characteristic curve 
method of equating and was used to obtain the linear coefficients for linking item parameter 
estimates obtained in the reference and focal groups. The coefficients, A and B, were then 
used in the following transformations to place the focal group item parameter estimates, ajp 
and bjkF , and their estimated variances onto the metric of the reference group: 



a jF — 


(47) 


tfkF = A x bjkF + B, 


(48) 


var(a* F ) = var (ajF)/ A 2 , 


(49) 


var (b* jkF ) = A 2 x var (b jk F), 


(50) 



where * indicates a transformed value. Iterative linking (Candell & Drasgow, 1988) was used 
with the chi-square test and the two area measures, Z(Sj) and Z(Uj). 

For the LR test (Thissen et al., 1988, 1993), the compact model was obtained by 
calibration over the combined reference and focal groups via the computer program 
MULTILOG (Thissen, 1991). MULTILOG permits constraints to be placed on the item 
parameters for estimation of the compact model. The item parameters for all internal anchor 
items in the augmented model were similarly constrained, and only the item parameters for 
the studied item were estimated independently in the reference and focal groups. 

The metric used in the likelihood ratio test is based upon the set of items contained in 
the internal anchor. If DIF items are present in the anchor, erroneous identification of items 
as DIF or non-DIF could result. In this study, we used a sequential approach to purify the 
anchor set. All DIF items were removed from subsequent anchor sets until no further DIF 
items were found. 

For the Mantel test and the generalized Mantel-Haenszel test, the same iterative 
purification procedure as used in the LR test was applied. DIF items were sequentially 
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removed until no DIF items were found. Each time a DIF item was identified, it was removed 
from the matching variable for subsequent DIF comparisons. For the SIBTEST, the scale 
purification procedure recommended by Stout and Roussos (1996) was applied. Results from 
each of the DIF detection methods were compared for the initial and final iterations. 

Results 

Classical Item Statistics 

Summary statistics for the reference and the focal groups are presented in Table 2. Item 
statistics, means and standard deviations, and correlations between the item score and the 
item-excluded total score are given in Table 3. 



Insert Tables 2 and 3 about here 



Results of the Chi-Square Test and the Area Measures 

Item parameter estimates and estimated variance terms are reported in Table 4. (MULTI- 
LOG does not provide estimates of the item parameter covariance terms.) These estimates 
were, used to calculate the metric transformation coefficients, A and B, required for iterative 
linking. Results for the chi-square test, Z(Sj), and Z{Uj) are presented in Table 5 for the 
first and final iterations, respectively. 



Insert Tables 4 and 5 about here 



On the first iteration, five DIF items were detected using the x], three DIF items using 
Z(Sj), and eight items using Z(Uj). Two iterations were required for x]> Z(Sj), and Z(Uj), 
respectively. The final iteration yielded the same set of DIF items for the xj and Z(Sj) 
methods and one additional item (item 26) for the Z(Uj) method. 

Results of the Tests Based on Observed Scores 



Results of the Mantel test, the generalized Mantel-Haenszel test, and the SIBTEST are 
presented in Table 6 for the first and final iterations. Three iterations were required for Mj 
and four for Q j. The SIBTEST purification process was implemented as described by Stout 
and Roussos (1996). 
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On the first iteration, 11 DIF items were detected using Mj and 10 using Q Eight 
items were identified in the Step 1 set by SIBTEST. The final iterations yielded 11, 14, and 
8 DIF items for the Mj, Qj, and the SIBTEST, respectively. Although the number of items 
identified by Mj and the SIBTEST were the same, the actual items were different. 



Insert Table 6 about here 



Likelihood Ratio Test Results 

Results for the analysis of the compact and the augmented models for studying item 1 are 
given in Table 7. The item parameter estimates and the standard errors for the compact 
model are given in the two columns to the left of the item numbers. The value of —2 log L for 
the compact model was 30250.1 (see footnote at the bottom of Table 7). The item parameter 
estimates and the standard errors for the augmented model are given to the right of those of 
the compact model. There are two sets of item parameter estimates for each studied item. 
The item estimates for the reference and focal groups for item 1 are given in Table 7 to 
illustrate that there are two sets for each studied item. When item 1 was the studied item, 
items 2 to 40 were used as the internal anchor set. The value of —2 log L for the augmented 
model for item 1 was 30241.7 and is given in the column to the right of the item parameter 
estimates. 



Insert Table 7 about here 



For item 1, the likelihood ratio test statistic was Gj = 30250.1 — 30241.7 = 8.4. This 
value was not significant at a = .01. Summary results for all 40 items are presented in Table 
8. The same 11 items were significant from the first and second (i.e., final) iterations. 



Insert Table 8 about here 



Comparison of DIF Indices 



Similarities between DIF indices was determined by comparing the ranks of the values of one 
index with the ranks for a second using Spearman’s p. Values for the two test statistics of 
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the area measures and the SIBTEST statistic Bj were first squared and then ranked. Results 
were compared for the iterative methods from the first and final iterations only; intermediate 
results were not included. 

Correlations between first and final iterations indicate the impact of the iterative 
procedures on the magnitude of the DIF indices. Spearman’s p values for the same DIF 
index ranged from .877 to 1.000 indicating the iterative procedures had a relatively small 
impact on the magnitudes of the DIF indices. 

There were moderate to strong relationships, among IRT-based DIF indices except for 
Z 2 (Uj). Comparable correlations in the moderate to strong range were observed between 
the observed score-based indices. Relationships between IRT-based indices and observed 
score-based indices were also of similar magnitude except for those involving Z 2 (Uj). 



Insert Tables 9 and 10 about here 



Agreement between items identified as functioning differentially by each index was 
assessed by calculating <p coefficients between the sets of detected items. The <p coefficients in 
Table 10 show moderately high to very strong agreement between first and final iterations for 
the Same indices with coefficients ranging from .688 to 1.0. This suggests that the iterative 
procedures generally had small to no impact on the items identified. 

Agreement tended to be moderate to moderately high (.462 to .640) between IRT-based 
methods. Agreement was modest to moderately high (.288 to .733) between observed score- 
based methods. Between IRT- and observed score-based methods agreement was generally 
modest to moderate (ranging from .288 to .657) except for those involving SIBTEST (ranging 
from .095 to .479. 



Discussion 



Detection and removal of DIF items on graded response tests is an important concern 
for test developers. Methods for detection of DIF in this important model are becoming 
increasingly necessary as performance-type assessments become more widely used. Selection 
of a DIF detection index, however, is often a difficult and even confusing task. This is 
especially so when DIF indices do not all identify the same items. In the present paper, 
several DIF detection indices for graded response items were examined, four IRT-based 
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measures of DIF in the graded response model were described along with three observed 
score-based DIF measures. DIF detection results for each of these indices using data from a 
test anxiety scale were then compared. 

The DIF detection methods examined all either permitted or required some kind of 
iterative or sequential removal of DIF items from the test defining the matching variable. 
The presence of DIF items has been shown to affect the quality of the common metric that 
is established as well as the quality of the DIF detection in dichotomous IRT models (Kim & 
Cohen, 1992; Shepard et al., 1984). The use of iterative or sequential methods purification 
of the test prior to making DIF comparisons, however, did not appear to have reduced 
differences observed in the DIF items identified by these seven methods. That is, there was 
strong similarity in the items identified within each method after purification. 

There was moderate to high similarity in the magnitudes of six of all DIF indices except 
for the unsigned area. These results are in general agreement with previous research with 
these same indices for both dichotomous models and graded response models. 

There was also overlap in the set of items identified by each of the seven measures. 
Unfortunately, the same items were not always identified by each method. This is not an 
uncommon finding and has led to the usual advice which is to not rely on results from a 
single DIF detection index. Instead, the recommendation is to use multiple DIF indices. 
Given the incongruity of agreement among the DIF indices in Table 10, this suggestion 
seems plausible. In fact, it might make some sense to select DIF detection indices which test 
for DIF in markedly different ways. In this way, one could hope for some sort of optimal 
coverage in identifying DIF items. 

To some extent, differences in the seven DIF indices can be ascribed, at least in part, to 
differences in the ways they each identify an item as functioning differentially. Inspection of 
each of the indices shows that the four IRT-based DIF indices each test for DIF in a different 
way from one another. Recall that, for the graded response IRT model, DIF was defined as 
occurring when Tjn(6) ^ Tjp(9). The Xj and Gj, however, both test for DIF by examining 
whether £jr = that is, whether the item parameters are equal in the reference and focal 
groups. The signed area measure, Sj, tests DIF as [Tjr(9) — Tjp(9 )] d9, and the unsigned 
area measure, Uj, tests this definition in a slightly different way, | Tjr{6) - Tjp{9) | dQ. 
Both of these approaches are different than the Xj and G j and both differ from one another 
as well. Further, if the distribution of Uj is not normal, the resulting DIF may be tested 
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with an incorrect error term. 

A similar point can be made about the DIF that is tested for in the three observed 
score-based methods, Mj, Qj, and Bj. Mj assumes an ordered set of categories, but Qj 
assumes nominal categories. When Mj is significant, then the assumption of conditional 
independence of the item score and the matching variable is rejected. That is, individuals 
in the focal and reference groups with the same level on the matching variable are likely 
to differ in their average performance on the studied item. When Qj is significant, then 
individuals with the same matching variable but in different groups tend to have different 
response patterns on the studied item. These two indices differ, in other words, in the way 
the identify DIF in an item. Bj is a measure of the difference in conditional probabilities of 
responding the same. 

One problem that appears to intrude on the equality of xj and Gj results is that these 
two indices are only equivalent asymptotically. Asymptotic results are not usually obtained 
in smaller samples or with shorter tests. In addition, estimation errors are present in the 
variances used to calculate x] • Further, computer programs such as MULTILOG do not 
provide the covariances needed for x] for the graded response model. 

One factor mitigating against use of Gj either with or without iterative purification is 
that the LR test with iterative purification is far more labor intensive than x]> and the area 
measures. Iterative linking methods for xj > Z(Sj), and Z(Uj) require only a single calibration 
of item parameters in each group followed by a series of relinking and recalculation of DIF 
indices. The observed score-based methods, however, were simplest of all to use. The Mj and 
Qj do not specifically require purification of the matching variable but it is recommended, 
and the SIBTEST does have a sequential procedure. 

The data presented in this study provide some evidence of the relationships and agreement 
among these methods. Given the importance of polytomous models such as the graded 
response model, further empirical evidence would be helpful in assisting test developers to 
select DIF detection indices. Results of this study can provide useful information about the 
relationships to expect between various DIF detection methods. 
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Table 1 

Data for the Ith Level of the Matching Variable 
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Table 2 

Summary Statistics for Reference (Female) and Focal (Male) Groups 



Statistic 


Group 

Reference 


Focal 


Total 


No. of Subjects 


226 


195 


421 


No. of Items 


40 


40 


40 


Mean 


84.33 


72.11 


78.67 


SD 


20.66 


17.62 


20.23 


Coefficient Alpha 


.93 


.93 


.94 


SEM 


5.35 


4.80 


5.15 





Table 3 

Item Statistics for Reference and Focal Groups 



Item 


Reference 






Focal 






Total 




Mean 


SD 


Corr. 


Mean 


SD 


Corr. 


Mean 


SD 


Corr. 


1 


2.50 


0.93 


.52 


1.99 


0.74 


.53 


2.26 


0.88 


.56 


2 


2.32 


1.10 


.41 


2.03 


1.01 


.40 


2.19 


1.07 


.42 


3 


1.43 


0.84 


.37 


1.48 


0.76 


.37 


1.45 


0.80 


.34 


4 


1.66 


0.92 


.50 


1.55 


0.81 


.49 


1.61 


0.87 


.49 


5 


1.92 


0.96 


.40 


1.52 


0.75 


.39 


1.73 


0.89 


.43 


6 


2.36 


1.12 


.45 


1.86 


0.96 


.56 


2.13 


1.07 


.53 


7 


1.54 


0.85 


.44 


1.56 


0.81 


.38 


1.55 


0.83 


.39 


8 


2.94 


1.01 


.46 


2.54 


0.99 


.43 


2.76 


1.02 


.48 


9 


2.01 


1.12 


.44 


1.82 


0.93 


.26 


1.92 


1.04 


.38 


10 


1.20 


0.60 


.18 


1.33 


0.76 


.37 


1.26 


0.68 


.22 


11 


2.97 


1.04 


.60 


2.21 


0.92 


.41 


2.62 


1.06 


.58 


12 


1.87 


0.95 


.44 


1.69 


0.79 


.43 


1.79 


0.88 


.44 


13 


2.66 


1.05 


.20 


2.41 


1.11 


.20 


2.54 


1.08 


.22 


14 


1.55 


0.86 


.44 


1.34 


0.73 


.36 


1.45 


0.81 


.43 


15 


2.60 


1.01 


.57 


2.13 


0.87 


.60 


2.38 


0.98 


.61 


16 


2.81 


1.05 


.58 


2.22 


0.92 


.54 


2.53 


1.03 


.60 


17 


1.60 


0.93 


.40 


1.60 


0.87 


.35 


1.60 


0.90 


.36 


18 


1.38 


0.76 


.46 


1.27 


0.64 


.46 


1.33 


0.71 


.46 


19 


2.04 


1.02 


.56 


1.62 


0.81 


.43 


1.85 


0.95 


.54 


20 


2.77 


0.99 


.52 


2.37 


1.01 


.55 


2.58 


1.02 


.56 


21 


2.47 


0.90 


.56 


2.32 


0.93 


.70 


2.40 


0.91 


.61 


22 


2.35 


1.09 


.41 


1.90 


1.01 


.51 


2.14 


1.08 


.48 


23 


1.80 


0.95 


.57 


1.36 


0.76 


.54 


1.60 


0.89 


.59 


24 


1.54 


0.87 


.52 


1.57 


0.81 


.56 


1.56 


0.84 


.50 


25 


1.75 


1.02 


.44 


1.36 


0.74 


.55 


1.57 


0.92 


.51 


26 


2.71 


1.08 


.62 


2.00 


0.96 


.60 


2.38 


1.08 


.65 


27 


2.54 


1.02 


.63 


2.13 


0.94 


.54 


2.35 


1.00 


.62 


28 


1.61 


0.95 


.45 


1.52 


0.84 


.50 


1.57 


0.90 


.46 


29 


1.77 


0.91 


.54 


1.69 


0.90 


.60 


1.73 


0.90 


.55 


30 


1.85 


1.05 


.40 


1.50 


0.78 


.46 


1.69 


0.95 


.45 


31 


2.10 


1.10 


.53 


1.78 


0.96 


.46 


1.95 


1.05 


.52 


32 


1.38 


0.80 


.44 


1.32 


0.70 


.52 


1.35 


0.75 


.46 


33 


2.33 


1.08 


.65 


1.91 


0.94 


.60 


2.14 


1.04 


.65 


34 


2.34 


1.01 


.65 


1.96 


0.90 


.47 


2.17 


0.98 


.60 


35 


2.58 


1.01 


.61 


2.10 


0.79 


.45 


2.35 


0.95 


.58 


36 


2.31 


1.06 


.55 


1.81 


0.94 


.49 


2.08 


1.03 


.56 


37 


2.51 


1.07 


.54 


1.76 


0.89 


.49 


2.16 


1.06 


.57 


38 


1.56 


0.84 


.43 


1.49 


0.76 


.47 


1.53 


0.81 


.43 


39 


1.98 


1.05 


.51 


1.86 


0.84 


.30 


1.93 


0.96 


.43 


40 


2.73 


1.08 


.60 


2.23 


0.98 


.55 


2.50 


1.07 


.61 
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Table 4 

Item Parameter Estimates for Reference and Focal Groups from Separate Calibration Runs 



Reference Focal 



Item 


a,R(s.e.) 


6i,- R (s.e.) 


6 2 ,- R (s.e.) 


^ 3 >R(s.e.) 


ajF(s.e.) 


6i 7 F( s - e ) 


6 2; F(s.e.) 


6 3TF (s.e.) 


1 


1.44(0.26) 


-1.43(0.25) 


1.07(0.25) 


2.35(0.49) 


1.34(0.22) 


-2.20(0.37) 


0.21(0.17) 


1.21(0.24) 


2 


0.84(0.23) 


-1.07(0.32) 


1.00(0.40) 


2.32(0.71) 


0.92(0.19) 


— 1.31(0.31) 


0.44(0.24) • 


- 1.52(0.42) 


3 


0.87(0.28) 


0.48(0.30) 


2.62(0.83) 


3.85(1.26) 


0.70(0.21) 


1.49(0.56) 


2.93(0.90) 


4.27(1.26) 


4 


1.26(0.30) 


0.15(0.19) 


1.68(0.43) 


2.72(0.62) 


1.04(0.21) 


0.13(0.19) 


1.64(0.35) 


2.72(0.55) 


5 


0.95(0.24) 


0.23(0.24) 


2.29(0.63) 


3.94(1.15) 


0.81(0.20) 


-0.70(0.28) 


1.46(0.42) 


2.97(0.74) 


6 


1.45(0.28) 


-0.51(0.16) 


0.80(0.22) 


1.80(0.37) 


1.03(0.19) 


-1.35(0.29) 


0.18(0.21) 


1.25(0.30) 


r 


0.90(0.27) 


0.19(0.26) 


2.39(0.73) 


3.27(0.99) 


0.88(0.19) 


0.50(0.25) 


2.44(0.55) 


3.26(0.71) 


8 


1.05(0.21) 


-2.30(0.44) 


-0.25(0.21) 


1.17(0.33) 


1.11(0.21) 


-2.67(0.51) 


—0.82(0.21) 


0.35(0.20) 


9 


0.60(0.21) 


-0.61(0.37) 


2.27(0.89) 


3.92(1.49) 


0.91(0.19) 


-0.40(0.22) 


0.93(0.31) 


1.93(0.48) 


10 


1.07(0.29) 


1.26(0.41) 


2.07(0.61) 


3.23(0.90) 


0.46(0.44) 


4.26(2.25) 


5.97(4.46) 


8.77(4.38) 


11 


1.09(0.22) 


-1.69(0.34) 


0.59(0.24) 


1.86(0.45) 


1.77(0.28) 


-2.00(0.28) 


-0.67(0.14) 


0.08(0.14) 


12 


1.12(0.25) 


-0.40(0.20) 


1.56(0.41) 


3.24(0.81) 


0.96(0.20) 


-0.46(0.22) 


1.23(0.32) 


2.84(0.65) 


13 


0.48(0.18) 


-2.55(1.01) 


0.10(0.45) 


2.45(1.11) 


0.38(0.17) 


-4.58(2.97) 


-0.79(0.59) 


2.44(1.23) 


14 


1.06(0.32) 


1.06(0.38) 


2.43(0.73) 


3.25(0.98) 


0.97(0.23) 


0.52(0.24) 


2.02(0.48) 


3.22(0.75) 


15 


1.70(0.29) 


-1.38(0.20) 


0.65(0.19) 


1.49(0.29) 


1.50(0.22) 


-1.87(0.26) 


—0.11(0.15) 


0.82(0.19) 


16 


1.40(0.27) 


-1.56(0.26) 


0.48(0.19) 


1.46(0.31) 


1.61(0.24) 


-1.88(0.26) 


-0.47(0.14) 


0.38(0.15) 


17 


0.86(0.27) 


0.26(0.28) 


1.96(0.68) 


3.49(1.24) 


0.93(0.21) 


0.56(0.25) 


1.83(0.48) 


3.01(0.74) 


18 


1.36(0.35) 


1.04(0.31) 


2.16(0.54) 


3.15(0.72) 


1.14(0.25) 


1.03(0.28) 


2.33(0.51) 


3.20(0.72) 


19 


1.08(0.26) 


-0.13(0.20) 


1.88(0.48) 


2.93(0.77) 


1.32(0.23) 


-0.74(0.18) 


0.78(0.22) 


1.75(0.31) 


20 


1.44(0.26) 


—1.51(0.24) 


0.09(0.17) 


1.05(0.26) 


1.28(0.21) 


-2.36(0.37) 


-0.46(0.17) 


0.72(0.20) 


21 


2.08(0.33) 


— 1.45(0.17) 


0.13(0.14) 


1.08(0.20) 


1.39(0.21) 


-2.03(0.31) 


0.08(0.16) 


1.42(0.26) 


22 


1.32(0.26) 


—0.51(0.18) 


0.81(0.25) 


1.67(0.38) 


0.88(0.19) 


-1.52(0.36) 


0.33(0.25) 


1.55(0.41) 


23 


1.59(0.33) 


0.71(0.22) 


1.72(0.36) 


2.24(0.46) 


1.55(0.22) 


-0.22(0.14) 


0.91(0.18) 


2.05(0.32) 


24 


1.55(0.31) 


-0.03(0.15) 


1.55(0.34) 


2.18(0.44) 


1.12(0.25) 


0.50(0.21) 


1.80(0.39) 


2.82(0.58) 


25 


1.71(0.33) 


0.65(0.20) 


1.65(0.33) 


2.35(0.49) 


1.01(0.21) 


0.07(0.19) 


1.33(0.32) 


2.27(0.48) 


26 


1.69(0.29) 


-0.89(0.16) 


0.63(0.19) 


1.36(0.28) 


1.90(0.26) 


-1.73(0.23) 


—0.18(0.13) 


0.36(0.14) 


27 


1.50(0.24) 


-1.17(0.20) 


0.39(0.19) 


1.59(0.32) 


- 1.70(0.23) 


-1.54(0.20) 


-0.18(0.13) 


0.93(0.18) 


28 


1.41(0.31) 


0.33(0.18) 


1.33(0.33) 


2.53(0.55) 


0.94(0.22) 


0.50(0.23) 


1.78(0.44) 


2.81(0.65) 


29 


1.81(0.26) 


-0.16(0.13) 


0.91(0.19) 


1.87(0.33) 


1.22(0.22) 


-0.24(0.17) 


1.25(0.26) 


2.57(0.46) 


30 


1.17(0.28) 


0.24(0.21) 


2.25(0.54) 


2.68(0.66) 


0.81(0.18) 


-0.15(0.24) 


1.35(0.38) 


2.50(0.59) 


31 


1.08(0.23) 


-0.34(0.20) 


1.27(0.34) 


2.16(0.49) 


1.19(0.20) 


-0.71(0.20) 


0.67(0.21) 


1.43(0.30) 


32 


1.69(0.36) 


0.77(0.20) 


1.76(0.36) 


2.43(0.51) 


1.08(0.24) 


1.26(0.30) 


2.03(0.43) 


3.24(0.70) 


33. 


1.86(0.31) 


-0.63(0.14) 


0.68(0.17) 


1.52(0.30) 


1.86(0.24) 


-1.04(0.15) 


0.20(0.13) 


0.95(0.16) 


34 


1.18(0.23) 


-1.01(0.23) 


0.90(0.27) 


2.20(0.48) 


1.79(0.24) 


-1.24(0.17) 


0.21(0.13) 


1.12(0.19) 


35 


1.14(0.26) 


-1.76(0.35) 


0.93(0.28) 


2.53(0.60) 


1.73(0.24) 


-1.68(0.22) 


-0.11(0.13) 


0.77(0.17) 


36 


1.20(0.24) 


-0.40(0.18) 


0.95(0.28) 


2.34(0.55) 


1.32(0.22) , 


-1.14(0.21) 


0.23(0.16) 


1.31(0.26) 


37 


1.16(0.26) 


-0.46(0.19) 


1.40(0.36) 


2.32(0.54) 


1.38(0.23) 


-1.57(0.25) 


0.02(0.16) 


0.39(0.20) 


38 


1.22(0.26) 


0.28(0.19) 


1.71(0.40) 


3.19(0.75) 


0.78(0.19) 


0.50(0.29) 


2.48(0.62) 


4.00(1.01) 


39 


0.71(0.20) 


-1.14(0.39) 


1.94(0.63) 


4.07(1.27) 


1.17(0.20). 


-0.44(0.19) 


0.74(0.22) 


1.93(0.38) 


40 


1.55(0.27) 


-1.28(0.21) 


0.29(0.17) 


1.29(0.26) 


1.69(0.25) 


-1.71(0.23) 


-0.32(0.14) 


0.38(0.15) 
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Table 5 

Lord's Xj, Z(Sj)> and Z(Uj) from the First and Final Iterations 



Item 


Lord’s Xj 


Z(Sj) 


Z{Uj) 


First 


Final 


First 


Final 


First 


Final 


1 


3.78 


3.46 


-1.68 


-1.63 


1.35 


1.47 


2 


1.29 


1.21 


-0.19 


-0.14 


0.07 


0.23 


3 


5.76 


5.98 


1.16 


1.22 


-0.06 


-0.12 


4 


3.69 


4.03 


1.04 


1.13 


0.14 


-0.03 


5 


1.94 


1.79 


-1.00 


-0.95 


0.03 


0.12 


6 


1.90 


1.84 


-0.89 


-0.82 


0.80 


0.76 


7 


4.95 


5.24 


0.86 


0.93 


-0.01 


-0.12 


8 


1.44 


1.29 


-0.06 


-0.06 


0.21 


0.41 


9 


6.86 


6.78 


-1.00 


-0.96 


0.71 


0.80 


10 


5.81 


5.92 


1.99 


2.01 


-0.24 


-0.23 


11 


20.27* 


19.65* 


-2.44 


-2.42 


4.95* 


5.12* 


12 


2.74 


2.87 


0.33 


0.40 


-0.82 


-0.82 


13 


0.49 


0.50 


-0.36 


-0.36 


-0.91 


-0.92 


14 


0.08 


0.07 


0.01 


0.08 


-1.20 


-1.16 


15 


2.07 


1.83 


-0.71 


-0.66 


0.13 


0.40 


16 


8.51 


7.99 


-1.35 


-1.32 


2.67* 


2.91* 


17 


4.46 


4.67 


0.43 


0.49 


-0.32 


-0.33 


18 


1.54 


1.77 


0.89 


0.98 


-0.13 


-0.29 


19 


3.75 


3.47 


-1.57 


— 1.50 


1.24 


1.43 


20 


0.30 


0.36 


-0.18 


—0.15 


-0.56 


-0.77 


21 


11.74 


12.55 


2.43 


2.47 


3.30* 


3.08* 


22 


2.54 


2.62 


-0.24 


-0.18 


1.30 


1.16 


23 


3.91 


3.60 


-0.96 


-0.86 


0.24 


0.50 


24 


19.70* 


20.54* 


2.79* 


2.88* 


2.78* 


2.60* 


25 


2.60 


2.71 


0.22 


0.32 


0.85 


0.74 


26 


6.52 


6.00 


-2.30 


-2.24 . 


2.57 


2.76* 


27 


2.37 


2.15 


-0.15 


-0.10 


1.03 


1.28 


28 


8.65 


9.22 


1.94 


2.03 


1.39 


1.23 


29 


15.23* 


16.20* 


3.28* 


3.38* 


3.82* 


3.59* 


30 


1.52 


1.54 


-0.28 


-0.20 


-0.25 


-0.23 


31 


1.46 


1.34 


-0.48 


-0.40 


0.41 


0.63 


32 


10.75 


11.42 


2.41 


2.51 


1.99 


1.82 


33 


1.06 


1.01 


-0.02 


0.07 


0.04 


0.29 


34 


8.85 


8.47 


-0.84 


-0.78 


2.89* 


3.10* 


35 


14.91* 


14.39* 


-1.47 


-1.44 


3.96* 


4.13* 


36 


2.67 


2.39 


-1.46 


-1.40 


1.10 


1.31 


37 


12.14 


11.65 


-3.20* 


-3.15* 


3.73* 


3.87* 


38 


8.36 


8.84 


1.90 


1.97 


0.79 


0.70 


39 


15.05* 


14.91* 


-0.89 


-0.84 


1.70 


1.80 


40 


3.58 


3.25 


-0.76 


-0.72 


1.50 


1.77 



*p < .01. The critical values are xi — 13.28 and Z = ±2.58. 
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Table 6 

Mantel M? , GMH Q?, and SIBTEST 0j(s.e.) from the First and Final Iterations 



Item 


M] 


9? 




0jt 


s.e.) 


First 


Final 


First 


Final 


First 


Final 


1 


6.63* 


8.19* 


7.23 


6.70 


— .311(.086)* 


— .278(.083)‘ 


2 


0.23 


0.49 


2.20 


1.04 


— ,010(.130) 


— .010(.130) 


3 


14.93* 


5.98* 


20.40* 


20.40* 


.227(.070)* 


,230(.090) 


4 


1.64 


5.73 


1.74 


5.13 


.123(.090) 


.123(.090) 


5 


3.35 


3.61 


5.05 


3.62 


-,209(.092) 


— .209(.092) 


6 


2.13 


2.49 


7.58 


2.73 


— . 1 12( . 1 18) 


— . 1 12(. 118) 


7 


13.14* 


5.24* 


14.02* 


14.02* 


.220(.090) 


.220(.090) 


8 


0.36 


0.59 


3.40 


0.84 


— ,078(.115) 


— ,078(.115) 


9 


0.02 


1.26 


5.49 


11.40* 


,239(.134) 


.239(.134) 


10 


7.47* 


5.92* 


7.55 


11.97* 


• 143(.071) 


.143(.071) 


11 


12.80* 


19.65* 


13.53* 


13.53* 


— .422(.105)* 


— .570(.098)* 


12 


2.24 


0.95 


3.81 


4.59 


.052(.085) 


.052(.085) 


13 


0.00 


0.47 


6.42 


3.26 


.002(.123) 


,002(.123) 


14 


0.13 


0.02 


0.97 


2.52 


— .006(.084) 


— .006(.084) 


15 


1.42 


0.52 


3.12 


2.99 


— .097(.110) 


— .097(.110) 


16 


7.53* 


7.99* 


9.84 


5.91 


— .360(.lll)* 


— .377(.095)* 


17 


5.97 


6.43 


7.37 


12.15* 


.178(.101) 


,178(.101) 


18 


1.08 


1.03 


1.74 


2.91 


— .055(.062) 


— .055(.062) 


19 


1.11 


1.08 


1.24 


0.43 


— ,013(.102) 


— .013(.102) 


20 


0.22 


0.06 


0.88 


0.84 


— ,066(.107) 


— .066(.107) 


21 


10.88* 


12.55* 


12.55* 


12.55* 


.157(.094) 


.157(.094) 


22 


0.54 


2.68 


1.97 


3.24 


— .248(.117) 


— ,248(.117) 


23 


6.03 


4.47 


17.13* 


17.13* 


— .217(.086) 


— .292(.077)* 


24 


13.67* 


20.54* 


17.16* 


17.16* 


,208(.081) 


.247(.080)* 


25 


0.84 


1.62 


2.63 


2.18 


— .205(.086) 


— .205(.086) 


26 


12.03* 


6.00* 


13.64* 


13.64* 


— .424(.100)* 


1 

b 

to 

-vi 

* 


27 


0.67 


0.12 


1.98 


0.43 


— .081 ( .113) 


— .08 1( . 113) 


28 


3.51 


2.64 


4.92 


8.18 


• 104(.085) 


,104(.085) 


29 


13.34* 


16.20* 


15.68* 


15.68* 


• 159(.082) 


,159(.082) 


30 


1.60 


1.55 


4.17 


5.45 


— .289(.092)* 


— .276(.088)* 


31 


0.33 


0.39 


0.41 


0.94 


,095(.115) 


.095(.115) 


32 


5.67 


5.94 


10.18 


18.77* 


.069(.074) 


,069(.074) 


33 


0.14 


0.82 


0.74 


2.61 


— .040(.105) 


— .040(.105) 


34 


0.49 


0.21 


0.73 


0.31 


.030(.107) 


,030(.107) 


35 


2.41 


1.02 


4.35 


6.07 


— .012(.103) 


— .208(.086) 


36 


2.71 


1.45 


4.08 


3.36 


— ,170(.114) 


— ,170(.114) 


37 


25.20* 


11.65* 


25.41* 


25.41* 


— ,503(.109)* 


— .556(.099)* 


38 


5.13 


2.61 


5.88 


6.35 


• 187(.083) 


.187(.083) 


39 


1.99 


2.67 


17.13* 


17.13* 


.306(.094)* 


,141(.096) 


40 


0.57 


0.94 


3.94 

c co 


4.01 


— .030(.109) 

. •£ _ 1 1 O A C 


— .030(.109) 

T2 



*p < .01. The critical values are xf = 6.63 for Mj and *3 = 11.34 for Qj . 
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Table 7 

Item Parameter Estimates from the Compact and Augmented Models and the Likelihood Ratio Statistic Gj for Item 1 



Augmented Model 

Compact Model a Reference/ Anchor Item Focal 



Item aj b\j b 2] hj fl jR frijR hjR b^jR aj f frljF hjF ^3jF AF(s.e.) -21ogL G 2 . 



1 


1.56 


-1.77 


0.41 


1.35 


1.56 


-1.69 


0.62 


1.81 


1.43 


-1.98 030 L24 — 0.05(.10) 


30241.7 8.4 


2 


0.94 


-1.26 


0.52 


1.63 


0.94 


-1.26 


0.52 


1.63 








3 


0.64 


1.07 


3.20 


4.73 


0.64 


1.06 


3.18 


4.70 








4 


1.13 


0.04 


1.54 


2.59 


1.14 


0.03 


1.54 


2.59 








5 


0.98 


-0.34 


1.50 


2.83 


0.98 


-0.34 


1.50 


2.83 








6 


1.33 


-0.95 


0.33 


1.24 


1.32 


-0.96 


0.33 


1.25 








7 


0.79 


0.30 


2.57 


3.51 


0.80 


0.30 


2.56 


3.49 








8 


1.19 


-2.42 


-0.64 


0.52 


1.19 


-2.43 


-0.65 


0.52 








9 


0.80 


-0.60 


1.21 


2.36 


0.80 


-0.60 


1.20 


2.35 








10 


0.56 


2.83 


4.22 


6.39 


0.56 


2.81 


4.18 


6.33 








11 


1.66 


-1.78 


-0.31 


0.45 


1.65 


-1.78 


-0.31 


0.45 








12 


1.01 


-0.54 


1.29 


2.90 


1.01 


-0.54 


1.28 


2.89 








13 


0.48 


-3.24 


-0.41 


2.08 


0.48 


-3.24 


-0.41 


2.08 








14 


1.08 


0.61 


1.96 


2.94 


1.08 


0.61 


1.96 


2.94 








15 


1.73 


-1.64 


0.10 


0.91 


1.73 


-1.64 


0.10 


0.91 








16 


1.69 


-1.72 


-0.18 


0.61 


1.68 


-1.73 


-0.18 


0.61 








17 


0.81 


0.37 


1.94 


3.38 


0.81 


0.37 


1.93 


3.37 








18 


1.18 


0.97 


2.22 


3.14 


1.18 


0.97 


2.21 


3.13 








19 


1.36 


-0.56 


0.97 


1.87 


1.36 


-0.56 


0.97 


1.87 








20 


1.45 


-1.92 


-0.29 


0.72 


1.44 


-1.92 


-0.29 


0.72 








21 


1.62 


-1.83 


0.01 


1.16 


1.62 


-1.83 


0.01 


1.16 








' 22 


1.16 


-1.01 


0.43 


1.38 


1.16 


-1.01 


0.43 


1.38 








23 


1.75 


0.06 


1.02 


1.88 


1.74 


0.06 


1.02 


1.88 








24 


1.10 


0.16 


1.77 


2.69 


1.11 


0.16 


1.76 


2.68 








25 


1.38 


0.24 


1.25 


2.00 


1.38 


0.24 


1.25 


2.00 








26 


2.04 


-1.32 


0.03 


0.56 


2.03 


-1.32 


0.03 


0.56 








27 


1.71 


-1.41 


-0.05 


1.02 


1.71 


-1.41 


-0.05 


1.02 








28 


1.08 


0.33 


1.50 


2.61 


1.08 


0.33 


1.49 


2.60 








29 


1.37 


-0.30 


1.03 


2.20 


1.38 


-0.30 


1.02 


2.19 








30 


1.03 


-0.06 


1.46 


2.25 


1.03 


-0.06 


1.46 


2.25 








31 


1.22 


-0.63 


0.76 


1.51 


1.22 


-0.63 


0.76 


1.51 








32 


1.22 ‘ 


0.96 


1.87 


2.87 


1.23 


0.96 


1.86 


2.85 








33 


1.98 


-0.92 


0.28 


1.01 


1.98 


-0.92 


0.28 


1.01 








34 


1.58 


-1.20 


0.32 


1.31 


1.59 


-1.20 


0.32 


1.31 








35 


1.60 


-1.71 


0.13 


1.09 


1.60 


-1.71 


0.13 


1.09 








36 


1.39 


-0.85 


0.38 


1.45 


1.39 


-0.85 


0.38 


1.45 








37 


1.50 


-1.05 


0.36 


1.11 


1.50 


-1.05 


0.36 


1.11 








38 


0.87 


0.34 


2.16 


3.73 


0.88 


0.33 


2.14 


3.71 








39 


0.91 


-0.84 


1.03 


2.54 


0.92 


-0.84 


1.03 


2.53 








40 


1.78 


-1.53 


-0.17 


0.57 


1.77 


-1.53 


-0.17 


0.58 









a The compact model yielded -21ogL = 30250.1. 
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Table 8 

Likelihood Ratio Statistic Gj from the First and Final Iterations 



Iteration 


First 


Final 


1 


8.4 


5.9 


2 


1.2 


1.1 


3 


22.2* 


22.2* 


4 


6.1 


6.2 


5 


6.0 


5.3 


6 


6.9 


5.6 


7 


9.8 


9.7 


8 


1.5 


1.1 


9 


10.6 


10.5 


10 


17.6* 


17.6* 


11 


28.2* 


28.2* 


12 


6.6 


7.1 


13 


2.8 


2.5 


14 


0.6 


0.8 


15 


5.6 


4.2 


16 


9.3 


8.0 


17 


11.4 


12.0 


18 


2.8 


2.9 


19 


2.9 


2.5 


20 


2.3 


2.0 


21 


17.2* 


17.2* 


22 


5.7 


4.2 


23 


10.8 


9.8 


24 


29.8* 


29.8* 


25 


10.4 


8.2 


26 


11.3 


9.2 


27 


0.4 


0.3 


28 


10.4 


11.8 


29 


15.5* 


15.5* 


30 


16.0* 


16.0* 


31 


0.8 


0.7 


32 


16.4* 


16.4* 


33 


0.9 


1.2 


34 


4.3 


2.4 


35 


15.7* 


15.7* 


36 


3.5 


3.0 


37 


21.3* 


21.3* 


38 


7.6 


8.2 


39 


26.5* 


26.5* 


40 


4.0 


3.1 


*p < .01 with xi 


= 13.28. 
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Table 9 

Spearman ' j Rho Coefficients Among DIF Indices 



DIF Index 


Iteration 


Lord’s xi 




z*( Uj) 


G 


2 

i 


M 


7 

J 






First 


Final 


First 


Final 


First 


Final 


First 


Final 


First 


Final 


„ 

First Final 


First Final 


Lord’s Xy 


First 




























Final 


.996 
























Z 2 (Sj) 


First 


.813 


.809 
























Final 


.813 


.810 


.994 




















Z 2 {Vj) 


First 


.647 


.634 


.526 


.518 




















Final 


.603 


.580 


.483 


.477 


.972 
















G 1 


First 


.804 


.808 


.695 


.686 


.387 


.293 
















Final 


.792 


.801 


.692 


.689 


.345 


.249 


1.000 












M] 


First 


.718 


.713 


.781 


.785 


.298 


.250 


.768 


.761 












Final 


.707 


.713 


.768 


.762 


.325 


.244 


.808 


.797 


.877 








Q) 


First 


.665 


.662 


.658 


.640 


.288 


.262 


.821 


.813 


.813 


.790 








Final 


.728 


.731 


.698 


.700 


.268 


.214 


.870 


.890 


.798 


.815 


.896 




B ) 


First 


.517 


.511 


.514 


.499 


.147 


.092 


.709 


.681 


.720 


.795 


.700 .687 






Final 


.562 


.554 


.566 


.551 


.213 


.156 


.735 


.708 


.757 


.779 


.680 .683 


.904 
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Table 10 

Phi Coefficients for Agreement Among DIF Indices 



DIF Index 


Iteration 


Lord’s Xj 


Z(Sj) 


Z(Vj) 


G 


i 

i 


M 


1 

J 






First 


Final 


First 


Final 


First 


Final 


First 


Final 


First 


Final 


First t* inal 


First Final 


Lord's xj 


First 




























Final 


1.000 
























Z{Sj) 


First 


.640 


.640 
























Final 


.640 


.640 


1.000 




















Z(Uj) 


First 


.569 


.569 


.569 


.569 




















Final 


.528 


.528 


.528 


.528 


.928 
















cj 


First 


.462 


.462 


.462 


.462 


.532 


.473 
















Final 


.462 


.462 


.462 


.462 


.532 


.473 


1.000 












M) 


First 


.493 


.493 


.493 


.493 


.577 


.657 


.550 


.550 












Final 


.462 


.462 


.462 


.462 


.532 


.607 


.498 


.498 


.937 








q] 


First 


.493 


.493 


.493 


.493 


.433 


.518 


.550 


.550 


.733 


.679 








Final 


.388 


.388 


.388 


.388 


.288 


.358 


.605 


.605 


.666 


.605 


.787 




B j 


First 


.095 


.095 


.095 


.095 


.219 


.329 


.392 


.392 


.433 


.532 


.433 .288 






Final 


.332 


.332 


.332 


.332 


.375 


.479 


.252 


.252 


.433 


.532 


.433 .288 


.688 
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